Edit Distance Problem
Let's solve the Edit Distance problem using Dynamic Programming.
Statement#
Given two strings, str1 and str2, find the minimum edit distance required to convert str1 into str2. Minimum edit distance is the minimum number of insertions, deletions, or substitutions required to transform str1 into str2.
Note: This problem has a direct application in the autocorrect feature. It is also used in bioinformatics to quantify the similarity between two DNA sequences.
The visualization below shows all these operations. Each operation has a cost of one unit. Therefore, you want to find the minimum cost of converting str1 into str2. The following visualization also shows how different sequences of operations can entail different costs.
1 of 10
2 of 10
3 of 10
4 of 10
5 of 10
6 of 10
7 of 10
8 of 10
9 of 10
10 of 10
The alignment depiction in the above visualization is a hint toward the solution. You basically need to find an alignment between two strings that has the least cost. When characters match, there is no cost, but there is a cost of one unit when characters do not match (insertion, deletion, or substitution), or when a character is skipped (insertion or removal in str1) in either of the strings.
Constraints:
-
str1.length,str2.length str1andstr2are in the same case English letters
Examples#
No. | str1 | str2 | Edit Distance |
1 | sunday | saturday | 3 |
2 | sam | sum | 1 |
Try it yourself#
Implement your solution in the following playground.
Note: If you clicked the “Submit” button and the code timed out, this means that your solution needs to be optimized in terms of running time.
Hint: Use dynamic programming and see the magic.
Solution#
We will first explore the naive recursive solution to this problem and then see how it can be improved using the Longest Common Substring dynamic programming pattern.
Naive approach#
A naive approach is to process all the characters of each string, one by one. Starting from the last character of both strings, there are two possibilities for every pair of characters being considered:
-
The last characters of both the strings match, in which case, the lengths of both the strings are reduced by one, and the function is called recursively on the remaining strings.
-
The last characters of both the strings do not match, in which case, all three operations (insertion, deletion, and substitution) are carried out on the last character of the first string.
-
The minimum cost for all three operations is computed recursively and returned.
Let’s implement the algorithm as discussed above:
Note: Please observe that if you include the test case commented out in the driver code, the solution is likely to time out. Try to solve the larger problem using the dynamic programming solutions provided below and see the difference.
Time complexity#
The time complexity of the naive approach is exponential, i.e., , where is the length of the longest string.
Space complexity#
As the maximum depth of the recursive call tree is , and each call stores its data on the stack, the space complexity of this approach is , where is the length of the longest string.
Optimized Solution using dynamic programming#
Now, let’s improve the recursive solution using top-down and bottom-up approaches.
Top-down solution#
The top-down solution, commonly known as the memoization technique, is an enhancement of the recursive solution. It overcomes the problem of calculating redundant solutions over and over again by storing them in an array. In the naive approach, it computes the minimum edit distance of the same substrings many times. Therefore, we need to memoize by using a 2-D array, lookup_table. This array is initialized using the lengths of both strings in the min_edit_dist function.
The basic algorithm works exactly the same as the one in the naive approach. The only difference is that for each character, the minimum edit distance value is stored in the lookup_table.
-
If the end characters match, the value returned is stored at
lookup_table[m-1][n-1] -
If the end characters don’t match, all three operations (insertion, deletion, and substitution) are carried out on the last character of the first string. The minimum cost for all three operations is then computed and stored in the
lookup_table[m-1][n-1]
Realize that since the values returned are being stored in the lookup_table, the next time the function is called with the same arguments, the recursive call won’t be made, and instead the value will be directly returned from the lookup_table.
Let’s implement the algorithm as discussed above:
Note: Please observe that if you now uncomment the large test case in the driver code, the dynamic programming solution will still work.
Time complexity#
As we avoided redundant calculations by storing all the intermediate results in a 2-D array, the time complexity of this approach is reduced to , where is the length of the longest string.
Space complexity#
We now need space in memory to store intermediate values.
Bottom-up solution#
The bottom-up solution, also known as the tabulation technique, is an iterative method of solving dynamic programming problems. In this solution, we will again make use of a 2-D array, lookup_table, that stores the minimum edit distance required to convert the first string into the second string for each character.
In this case, we build the solution bottom-up by dividing our problem into subproblems.
-
We know that if the first string is empty, we perform the insert operation on all characters of the second string to make both strings similar. Therefore, the minimum number of operations will equal
j. Similarly, if the second string is empty, the delete operation is performed on all elements of the first string to make it similar to the second string. So, the minimum number of operations will equali. Since this is known, we fill thelookup_tableaccordingly. -
If the last characters are the same, we store the returned values from
lookup_table[i-1][j-1]inlookup_table[i][j]. -
For characters that don’t match, the algorithm works by performing all three operations on the last character of the first string and calculating the minimum edit distance required to convert the first string into the second string. This value is then stored in the
lookup_tableto avoid recalculations.
Let’s look at the following illustration to get a better understanding of the solution:
1 of 22
2 of 22
3 of 22
4 of 22
5 of 22
6 of 22
7 of 22
8 of 22
9 of 22
10 of 22
11 of 22
12 of 22
13 of 22
14 of 22
15 of 22
16 of 22
17 of 22
18 of 22
19 of 22
20 of 22
21 of 22
22 of 22
Let’s implement the algorithm as discussed above:
Note: Please observe that if you now uncomment the large test case in the driver code, the dynamic programming solution will still work.
Time complexity#
As we created a 2-D array to store the results of sub-problems that would be used later on, the time complexity of this approach will also be .
Space complexity#
We need space in memory to store the intermediate values.
Minimum Number of Deletions and Insertions
Longest Repeating Subsequence